Network Inference with Hidden Units 



Joanna Tyrcha 

Mathematical Statistics, Stockholm University, S106 91 Stockholm, Sweden 

j oanna@math .su.se 

John Hertz 

Niels Bohr Institute, Blegdamsvej 17, 2100 Copenhagen 0, Denmark 
NORDITA, Roslagstullsbacken 23, 10691 Stockholm, Sweden 
hertz@nordita . dk 



Abstract 

We derive learning rules for finding the connections between units in stochas- 
tic dynamical networks from the recorded history of a "visible" subset of the 
units. We consider two models. In both of them, the visible units are bi- 
nary and stochastic. In one model the "hidden" units are continuous- valued, 
with sigmoidal activation functions, and in the other they are binary and 
stochastic like the visible ones. We derive exact learning rules for both cases. 
For the stochastic case, performing the exact calculation requires, in general, 
repeated summations over an number of configurations that grows exponen- 
tially with the size of the system and the data length, which is not feasible 
for large systems. We derive a mean field theory, based on a factorized ansatz 
for the distribution of hidden-unit states, which offers an attractive alterna- 
tive for large systems. We present the results of some numerical calculations 
that illustrate key features of the two models and, for the stochastic case, 
the exact and approximate calculations. 
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1 Introduction 



Recent interest in network identification problems has been motivated by the 
advent of multi-electrode neural recordings and other large-scale biological 
data [U El EJ H] . Current inference methods, however, do not take into account 
the effects of units in the networks that are not recorded, though they are 
almost always present. This problem can be serious: For example, in cortical 
neural data, almost all recorded cells are excitatory, though inhibitory cells 
are essential in the network dynamics. In this paper we extend previous 
methodology to include "hidden units", presenting algorithms for inferring 
the strengths of connections to, from and among them. 

There is a long history of work of problems of this sort. Perhaps the best 
know is that on "Boltzmann machines" [5] . These are symmetrically coupled 
networks of stochastic binary units. Their states are updated, one randomly 
chosen unit at a time, with the probability of being in a particular one of 
its two possible states given by a logistic sigmoid function of the net input 
from other units. Because of the symmetric coupling matrix, their dynamics 
satisfies detailed balance, so their equilibrium distributions are of Gibbs- 
Boltzmann form Z _1 exp(— E), where E is a quadratic form. This fact that 
simplifies their analysis considerably. The problem has also been studied in 
networks where the unit outputs are continuous sigmoidal functions of their 
inputs, for both continuous-time (asynchronous-update) and discrete-time 
(simultaneous-update) dynamics, extending the back-propagation algorithm 
used earlier for layered networks. 

Applying either of these kinds of models to multineuron spike data is 
problematic. Real biological networks do not have symmetric connections, 
invalidating the first kind, while the nature of synaptic transmission and 
neuronal spiking calls for a stochastic binary representation, ruling out the 
second. In this paper we treat models in which the recorded neurons are 
stochastic and binary, and there is no symmetry requirement on the con- 
nections in the network. They obey a discrete-time kinetic Ising (Glauber) 
dynamics [BJ, and a value +1 represents an action potential. We study two 
kinds of models, in which the hidden units are deterministic or stochastic, re- 
spectively. We employ, for convenience, a discrete-time dynamics [TJ, though 
it should be straightforward to extend the treatment to continuous-time mod- 
els. 
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2 Continuous, deterministic hidden units 



We examine first the deterministic case, taking the output of a hidden unit 
to be a sigmoidal function of its input. Though it is a big simplification 
of a real spiking-neuron network, this kind of model can be practical for 
analyzing neural data. One cannot hope to model the detailed dynamics of 
all the unrecorded neurons in the network of interest, because they vastly 
outnumber the recorded ones. What one can hope to do, at least as a first 
approximation, is to describe the effect of unrecorded populations of neurons, 
for example, of inhibitory neurons when only excitatory neurons have been 
recorded. The values of the hidden units in our model here represent the 
firing rates of those populations. While this representation throws out many 
details, it enables one to capture some essential features of the dynamics, 
even using only one or a few hidden units. 

We draw here on work in learning in analog neural networks a couple 
decades ago, under the names "back-propagation in time" and "recurrent 
back-propagation" jSJ EJ [TUl E]. Our treatment differs from that work in 
having stochastic visible units and a likelihood-based objective function. 



We denote the states of the visible units by Si(t), where i labels the unit and 
t the time bin. They can take the values ±1. (We assume the recorded spikes 
have been sorted into time bins small enough that there is no more than one 
spike per bin.) We denote the hidden unit values by fJ, a (t), 1 < // (i) < 1. 
To make our equations a little more transparent, we use indices i, j, ■ ■ ■ 
for visible units and a, b, ■ ■ ■ for hidden ones. Our model is defined by the 
stochastic evolution rule 



2.1 Model 



P[ Si (t + l)\{s(t),v(t)}} 



//„(* + 1) 



exp[ Si (t + l)Hi(t)] 

2 cosh Hi(t) 
tanh B a {t), 



(1) 
(2) 



with 




(3) 
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All Si(t + 1) are assumed independent, conditional on {sj(t)}, {//&(£)}. The 
model is pictured in Fig. [TJ We do not write constant bias terms in H or B 
here; they can be included by adding input units which are always +1. We 
will denote the number of visible units by N v and the number of hidden ones 
by N h . 




Figure 1: Schematic picture of the model. (Color online) White squares 
represent visible units s^; blue ones, hidden units \i a (or a a when they are 
stochastic). Visible- visible connections Jy are black, hidden-to- visible ones 
K ib are red, visible-to-hidden ones L a j are blue, and hidden to hidden ones 
M ab are green. Rows represent time steps. 



2.2 Objective function and learning rules 

We assume we are given the data and that we know the number of 

hidden units. The task is to learn the connections {Jij}, {K ia }, {L a j}, and 
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{M ab }, and our objective function is the log likelihood of the observed visible 
history: 

£ = + l)Hi(t) - log 2 cosh Hi(t)]. (5) 

it 

We consider the simplest form of gradient-based learning, where the pa- 
rameters are adjusted proportional to the derivative of the log likelihood with 
respect to them. For {Jij} and {K ia }, this is straightforward: 

AJ« = Vh(t+l)-tanh(^(t))]^^ = Ve fc (t + l) ai (t), (6) 
A^ fc6 = V[s i (t+l)-taoh( J H i (t))]^^ = y;e Jfc (t+l)^(t), (7) 

z — OA tfe — ' 

if K0 t 

with €k(t + l) = Si(t) — tanh.ffj(t), the observed error on unit i at i + 1 under 
the model with the current parameters, given its state at t. This is standard 
error x input learning, as in networks without hidden units (HI |3] . 

For the connections that lead to hidden units, the derivatives of 
with respect to {L a j} and {M ab } are through its dependence on the /x&(t); as 
in 



<9L 



<9L 



(8) 



«/ 



Furthermore, the derivatives of the fij(t) have terms proportional to deriva- 
tives of all the fis at the previous time step: 



dL 



al 



(9) 



These equations can be iterated starting from the initial condition <9/z c (0) / dL a i 
0. The solution can be written relatively compactly: 



dL a i 
where 



t-i 



X bb (t) \ 5 abSl (t - 1) + 

q=l 



r=l 



Sl {t-q-l) 



ba 



X ab (t) = {I - £{t))5 ab 



(10) 
(11) 



5 



and we make the convention that the product over r is equal to 1 when q = 0. 
The learning rule for L a i can then be written as 

t-i 

t g=0 i 

(12) 

Exactly the same procedure for the derivative with respect to M ab gives 
t-i 

t q=0 i 

(13) 

which differs from (Tl2|) only in the last factor. 

This all has a nice graphical interpretation. The effective error is the sum 
over all paths starting at future visible units (time t + 2 + q) and propagating 
back through the hidden units at intermediate times until it reaches the 
receiving unit a at time t+1. For each such path, we pick up a factor e«(t + 
q + 2) at the visible error source, a factor of a for backpropagating from 
the source unit to a hidden unit b, factors of elements of M for the hidden- 
to-hidden connections on the path, and factors of X cc = 1 — [i 2 c at every 
hidden unit c that it passes through. This is just the standard prescription 
for back-propagation of errors in layered networks. Fig. [2] shows a typical 
path for q = 2. 

2.3 Numerical results 

In this calculations reported in this paper we restrict ourselves to networks 
with no hidden-to-hidden connections (M = 0). This simplifies the learning 
algorithm considerably: There are no backpropagation paths longer than 
two steps. Fig. E] shows an example of learning for a network with 18 
visible and 2 hidden units, based on 10000 time steps of data. The top left 
panel shows how the cost function (the negative log-likelihood of the data) 
falls smoothly to a minimum. The top right panel shows the evolution of 
the errors in the couplings Jy, Kyg and L a j under learning. The apparent 
poor performance can be understood by comparing the middle panels, which 
show the coupling matrix elements of the model that generated the data 
(left) and the inferred couplings (right), respectively. It is apparent that the 
input connection strengths L2j to the second hidden unit (unit 20 in these 



KX(t + 1 + q) J] MX(t + 



si(t), 



vr=l 
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/+4 HJ f + 2 f+1 | 




Figure 2: Back-propagation of errors from the future through the hidden 
units. The example path here starts at a visible unit i where the output 
error €i(t + 4) is measured. It is then propagated back in time, first to a 
hidden unit at time t + 3, then through another hidden unit at t + 2 and 
finally to the one at t + 1 which is the receiving unit on the connection being 
evaluated. It gives a change in that connection strength equal to the product 
of €i(t + 4), all the connection strengths on the path, and factors of 1 — ^lif) 
for each hidden unit on the path. The total connection strength change is a 
sum over all such paths from all visible units in the future. 
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plots) are negatives of each other in the two panels. The same is true of the 
outgoing connections K i2 from that unit, though it is hard to see in these 
graphs. These two inversions have no effect on the visible units, so the "true" 
model and the one with the flipped signs of L 2 j and K i2 are equivalent: there 
is no way we can know from the visible data alone which one was the true 
model. The bottom left panel shows how, if we bias the initial random values 
of the couplings to have the right sign, results close to the true model are 
obtained. The equivalence of the two inferred models is apparent from the 
fact (bottom right panel) that the final values of the cost function are exactly 
the same. 

In this example it was easy to see the relation between the inferred and 
true connections. However, in general there is a 2 Nh x A^l-fold degeneracy 
(the signs of the connections to and from every hidden unit could be flipped, 
and the labels on the hidden units can be permuted arbitrarily.) Thus, for 
large Nh, most likely one will infer one of the models equivalent to the true 
one, but not the true one itself. 



3 Stochastic hidden units 

The case where all units in the model, including the hidden ones, are stochas- 
tic is more difficult, but it is the more interesting one from a theoretical point 
of view. Denoting the hidden units by cr a (t), the dynamics are now given by 

exp[ Si (t + l)Hi(t)} exp[a a (t + l)B a {t)} 



P[ Si (t + l),a a (t + l)]{s(t),a(t)}\ = 
with 



2coshif i (f) 2 cosh B a (t) 

(14) 



Hi(t) = JiMt) + ^2 K ib°b(t) (15) 

3 b 

B a {t) = Y, L *3 S 3it)+^ M ab<Jb{t). (16) 
3 b 

We restrict our treatment to networks with weak dense random connections, 
J ij: L aj = 0(l/y/TQ, K tb ,M ab = 0(1/ y/TQ, so that H^t) and B a (t) are of 
order 1. 

The likelihood of the history of the full system is 

P[s, a] = H P[ Si (t + 1), a a {t + l)]{s(t), <r(f)}], (17) 

tia 
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50 100 150 200 250 300 350 400 50 100 150 200 250 300 350 400 



Figure 3: Learning example: network of 18 visible and 2 hidden units (no 
hidden-hidden connections). Top left: iterative minimization of the cost 
function —C. Top right: rms errors on Jy, K^, and L a j as functions of the 
number of iterations of the learning algorithm when it is started at small 
random values of the couplings. Middle panels: true (left) and inferred 
coupling strengths. The hidden units are number 19 and number 20. Bottom 
panels: rms errors (left) and cost function (right) when the initial parameter 
values have the correct signs. 
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and the likelihood of the visible history is 



P[a] = 5>[*,4 



(18) 



The distribution of the a, conditional on the observed data, is 

P[s,a] 



Pas] 



P[s] ■ 



(19) 



This has the form of a Gibbs distribution Z s 1 exp(— E s [a]), with 

E s [a}= log P[s, a] (20) 

and 



z s = p[s\ = Y,p[sM- 



(21) 



(Z s also depends on all the model parameters {Jij, L a j, M^}, but to 
save some space we do not write that explicitly.) To show the nature of the 
interactions in the energy -E^cr], we write it out explicitly: 

t \ ij ib 



0-3 



log 2 cosh 

i 

log 2 cosh 



^^•(0+^^6(76^) 

^L a ,s,(t) + ^M ab a 6 (t) 



(22) 



The first term is just a constant (independent of the as), the next two are 
like external fields acting on the cr a (t) from the visible data Si(t ± 1) one 
time step in the future and past, respectively, and the fourth term represents 
interactions between as at successive time steps. The final two terms are 
interactions among all the as at one time (but these terms do not couple as 
at different times). Their non-polynomial form leads to important features 
in this problem that are not present in Boltzmann machines. 
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3.1 Exact learning algorithm 



Just as for Boltzmann machines, we can derive an exact learning algorithm 
for the model parameters by gradient ascent on Z s , the log likelihood of the 
visible history. It can be written 

d lo Z 

AJ i:j ex °f s = J2[si{t + 1) - (tanhff^))^]^) (23) 

v j t 

AKlb K ^^ = Z)([ s i(* + 1 )- tanh ^(*)]^(*))^ ( 24 ) 

lb t 

AL aj ex dl °^ Zs =J2M t + 1 )~ ttmhB a (t)) alsSj (t) (25) 
AM ab oc 9l ° g Zs = ^2([<x b (t + 1) - tanhB a (t)]tr h (t)) g | 8 (26) 



a6 



The averages (• • • ) CT | S are over all hidden histories cr(t), weighted by the proba- 
bility P[c|s] = Z~ l exp{— £? s [o"]} that they produce the known visible history 
s. In each learning rule, the first term comes from differentiating the terms 
in the first two lines of ( 122|) and the second from differentiating one of the 
log 2 cosh terms. 

When there are no hidden-to-hidden connections M a b, P[cr\s] becomes a 
product of independent terms, one for each t. The averages over P[cr|s] in 
f f23ll26|) then involve sums over 2 Nh terms, where Nh is the number of hidden 
units. For small networks, they can be computed exactly in a reasonable 
time. 



3.2 Numerical results 

In Fig. H]we show, for a model with Nh = N v = 10 (and, again, no hidden- 
to- hidden connections), how the log-likelihood of the data converges to its 
asymptotic value as the number of steps in the data set is increased. All 
the couplings in this example were i.i.d. and normal with variance 0.1. In 
addition to the cost function — £ evaluated on the training data, we also plot 
it evaluated on an independently-generated test data set. We also plot the 
values of the Akaike and Bayesian information criteria, based on the training 
cost function. The Akaike information criterion penalizes the estimated log 
likelihood (i.e., increases the cost) by the number of parameters N, and the 
Bayesian information criterion penalizes it by N log N. 
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10 1 10 2 1 3 1 4 

T 

Figure 4: Cost functions for learning in a network of 10 visible and 10 hidden 
units. (Color online) Blue: evaluated on training data. Green: evaluated on 
independent test data. Red: Akaike information criterion (AIC [12J). Cyan: 
Bayesian information criterion (BIC [13]). Purple: T — > oo limiting value. 
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For networks larger than ~ 10, one has to resort to Monte Carlo to 
estimate the averages. When there are hidden-to-hidden connections, the 
number of states to sum over becomes 2 NhT , where T is the number of time 
steps in the data. In this case, exact calculations are never possible, even 
for just one hidden unit, and even Monte Carlo becomes impractical for 
moderate numbers of hidden units. 



4 Mean field theory for stochastic hidden units 

An attractive approximate alternative is mean field theory. It can be for- 
mulated variationally [TJ]: One seeks the best approximation to f that 
factorizes over the different cr a (t). Each such factor is parametrized by a 
single number: the probability that a a (t) = +1. Equivalently (and con- 
ventionally), one can use the "magnetization", denoted /U a (t), which is the 
difference between the probabilities to be +1 and —1. The entire factoriz- 
able distribution is then parametrized by the set of magnetizations {fi a (t)}. 
The learning proceeds in an EM fashion [T51 HE] , iterating the two steps: (1) 
For given coupling parameters, find the /x a (t) that maximize the factorized 
\ogZ s , and (2), for these /i a (£), improve the estimates of the coupling param- 
eters as in rules (I23H26I) but with the averages computed under the factorized 
approximate P[a|s]. 



4.1 Derivation of mean-field theory 

Under the factorizability assumption, the likelihood of the visible data {si(t)}, 
given (<7 (t)) = fi a (t), is 

Pmf[v, s] = exp{S[fj] - E s [/i\} = exp { J i} , K ib , L aj , M ab }}, (27) 

where 



(28) 

is the entropy: the average log of the probability of magnetizations fj, a {t)- In 
f )27|) we indicate explicitly that A depends on the parameters { Jy, K ib , L a j, M ab } 
(through Eg). Thus, the EM learning procedure involves repeatedly maxi- 
mizing over /j, for fixed parameters (the "E-step" ) and taking uphill steps on 
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A (equivalently, downhill steps on E s ) in parameter space for fixed « (the 
"M-step"). 

The prescription for obtaining the average energy by the replacement 
&a{t) —> ^a(t) in E s is based on the independence of different a a {t) under the 
factorized distribution. For example, if E s contains a term like (T a (t)ab{t), 
then {a a {t)a b {t) = (cr a (t)) (<7{,(t)) = fi a (t)fi b (t). Thus, one might think that 
we get the E s \p] to use in f[2"T|) by simply substituting /i for a in f[2"2"j) . Then 
maximizing A\p] would lead to the equations 



for the Ha{t)- This equation has a nice interpretation: The first two terms are 
just the inputs from visible and hidden units, respectively, at the previous 
time step, and the last two terms are just the back-propagated errors from 
visible and hidden units one time step later. 

However appealing this equation looks, it is wrong. One has to be careful 
in the log 2 cosh terms in Expanding it in powers of the K ib , we get a 

second order term proportional to J2ab KiaKib&a&b- The double sum includes 
terms with a = b, and for these terms we should make the replacement 
al = 1, not <j\ = ji 2 a . (This situation does not arise for the usual Ising 
energy —J2i<jJij s i s j> since the i — j term is explicitly excluded from the 
sum. ) The same problem comes up in all higher-order terms in the expansion 
whenever there are repeated indices in the sums over hidden unit indices. 
This problem was noticed already by Saul et al [IT], who tried to deal with 
it by introducing an extra set of variational parameters. Here, we make 
a treatment for a particular ensemble of models that is exact (within the 
factorization approximation) in the limit of large N^. In these models, all 
the couplings are zero-mean independent random numbers with variances 
proportional to 1/N. This makes the net inputs Hi(t) and B a (t) Gaussian 
(for large N), with variances of order unity. 

Writing the nth order term in the expansion of log 2 cosh H (we drop the 
visible unit index % temporarily here, for simplicity) as 



tanh fJL a (t) = Y.j L ajSj(t - 1) + Y,b M abHb(t - 1) 
+ J2i\ s i( t + 1 ) - tanh JijBj(t) + Y,b K ibHb(t) | K ia 

+ J2 b {»b(t + 1) - tanh [£\ L bjSj (t) + £ c M bc u c (t)] } M ba 



(29) 




(30) 
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consider first the terms in every term of (I30p where two of the indices are 
equal. There are n(n — l)/2 such pairs, so the correction to this subset of 
the nth order terms is 



^( 2 ) = I C J1 V K ■■■K a ■■■a 

ai,a,2,---a n -2 



(31) 

because naive substitution of fi a for a a would have given /z 2 instead of 1. 
But what multiplies the sum on a here is just half the n — 2nd term in the 
expansion of the second derivative of log 2 cosh if, i.e., 1 — tanh 2 H. So we 
can sum all such terms over n, yielding a correction 

£ 2 = |(l-tanh 2 ii)£^ 2 (l-u 2 ). (32) 

a 

Thus, at this level of approximation, we should use an energy E s [fi] in which 
^ i log2coshif i (t) is replaced by 

£log2cosh^(t) + §£[1 -tann 2 ^)]^ - u 2 (t)] (33) 

i ia 

(now with the substitution a — > fi in the first term). This looks like the TAP 
term in the free energy for the usual Ising model, but with the opposite sign. 
The same argument applies to the log 2 cosh B term in E s , which should be 
replaced by 

Elog2cosh J B a (t) +i^[l-tanh 2 B a (t)]M^[l-^(i)]. (34) 

a ab 

These corrections will lead to new terms in the MF equations for /i a (t) and 
in the learning rule for the K ia and M^. Note also that, for the models we 
are considering, these correction terms are of order 1 (per visible or hidden 
unit, respectively) since they contain sums of Nh terms and each term is of 
order 1/N h . 

We can also sum up terms with 2 pairs of indices equal, 3 pairs of indices, 
equal, etc. Consider first the terms where two pairs of indices are equal. In 
the nth order term a n fl30l) . there are n!/[4!(n — 4)!] ways of picking the 4 
indices and 3 ways to pair them. The correction is 

- 2 



(4) _ ^ C n \ ^ jy- jy- 

7 « ~4!( n -4)! ^ 1 "'^ a ^ a ^ 

ai,a3,---a n -4 



(35) 
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The sum over n of these terms is just 



3 <9 4 (log 2 coshi/) 



4! 



dH 4 



Ha 



(36) 



Like 032]), 035]) is of order 1. 

Extending this argument to the general term with j/2 pairs of coincident 
indices, in the nth order term, there are n\/[j\(n — ways to pick our the j 
indices, and the number of ways to pair them is (j — 1)!! = (j — l)(j— 3) ■ ■ • 3-1. 
Thus, we get a correction 



(j - l)\\d j (log 2 cosh#) 



(37) 



Again, all these terms are all of order 1. 

On the other hand, terms we have not considered, with 3 or more indices 
equal, are negligible in the mean-field limit Nh — > oo. (Consider terms with 
p equal indices. They involve the sum K%, which is of order N^~ p ^ 2 and 
therefore negligible for p > 2 as — > oo. 

Now we can sum all the Ej over j, exploiting the fact that (j — 1)!! is the 
jth moment of a zero-mean univariate normal distribution. The result of all 
these manipulations is simply the replacement 



log 2 cosh Hi(t) 



Dxlog2cosh{H i (t) + Ai(t)x], 



where Dx means (2tt) 1,/2 e x2 l 2 dx and 



^{t) = Y, K i^-A{t)\- 



(38) 



(39) 



Thus, the effect of all these corrections can be described in terms of an 
effective Gaussian noise. The same arguments apply to the log 2 cosh B term, 
with the final result that the effective energy can be written, exactly in the 



16 



limit iV/j — > oo, as 



E s [tA 



with 



+ ^// a (t + l)L aj s,(t) +^)/i B (t + l)M ab fx b (t) 

aj b 

- J2 Dx log 2 cosh Ji^A 1 ) + J2 K ^b{t) + 

i J Li b 

- I D y lo s 2 cosh Z L ^ s At) + Z M »&M*) + r (*)y 

a ^ Li 6 

r^) = EM a 2 ji-^ (t)] . 



>40) 



(41) 



We note that this form could have been motivated heuristically: In ffTBT) 
and (fT6l) . the <T;,(t) are fluctuating variables of variance 1 — ulit). Since 
and M ab are assumed to be independent random variables, Hi(t) and B a (t) 
are normally distributed with variances A 2 (t) and r 2 (t) given by (1391) and 
TT]), respectively. 



4.2 Learning algorithm 

The resulting equations for the E-step are then 

tanh -1 fx a (t) = J2j L ajSj(t - 1) + Y,b M ab^b{t ~ 1) 
+ I2i{si(t + 1)- fDx tanh ViC*) + E 6 # ft /i 6 (*) + A^x] } K ia 

+MaEi j 1 " I Dx tanh 2 ^.(t) + Eft^&M&W + A<(t)z] } K 2 Q 
+ Eft + 1) - / £>ytanh [E,- L 6i Si(t) + E c M fec/ u c (t) + T b (t)yj } M 6a 
+ftE 6 {l - / Dy tanh 2 . L bjSj (t) + E c M bc/ u c (t) + T b (t)y\ } M 2 a .(42) 

They differ from the naive equations (1291) in that the tanh terms in the second 
and fourth lines are averaged over the Gaussian noises and in the presence 
of the new terms on the third and fifth lines. The latter have the form of 
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cavity field corrections [18]: The effect of \i a itself on the expected Si(t + 1) 
and Hb(t + 1) should not be counted in calculating the tanh if terms in the 
second and fourth lines. 

For the M-step, the learning rules for and L a j are 



AJ - K d.J l3 

dE s 
" ~dL- 



+ 1) - J Dxt&nh [H t (t) + A<(t)a:] J Sj(*X43) 
= J] {m* + 1) - y ^ tanh [B a (t) + r a {t)y)} Sj ($£) 



differing from those we would find in the naive mean field theory only in the 
averaging of the tanh's over the Gaussian noises. The rules for K ib and M ab , 

dE s 



AK ih oc 



dK, 



| \Si(t + 1) - J Da; tanh [H t (t) + A< 



(t)x] fl b (t) 



Dxtfmh 2 [Hi(t) + Ai(t)x] 



K ib [l-fx 2 b (t)\ 



(45) 



AM nh oc 



dM, 



ab 



| [Va(t + 1) - J Dy tanh [B a {t) + T a (t)y]j /i b (t) 



Dyta,nh 2 [B a {t)+T a {t)y] 



M ab [l-i4(t)\ 



(46) 



have extra terms that come from the dependence of Aj(i) and T a (t) on Ki a 
and M ab in ( 1391) and (l4"Ij) . respectively. 

For small i£j a and M afe (i.e., at the level of the corrections fl33l) and fl34|) . 
the E-step equations reduce to 

tanh" 1 /i a (t) = J2L ajSj (t- 1) + ^M a6 /x 6 (t-l) 

+ X) + ^ ~ tanh AT ia + [1 - tanh 2 Hi{t)]Klii a {t) 

i 

+ tanh ^(t)[l - tanh 2 Hi{t)]K ia J2 K ?b[i- -/4(f)] 
+ i ^ + ^ ~ tanh B ^)\ M ba + [1 - tanh 2 5 b (t)]M 2 a /i a (t) 



tanh B 6 (t)[l - tanh 2 B b (t)]M ba £ M^[l - /i 2 (t)] 



(47) 
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and the learning rules are 






oc 


V{si(t + 1) -tanh#i(t)[l - (1 - 

t 






oc 


53{Ma(* + l)-taohS (t)[l-(l- 
t 


-tanh 2 B a (t))r o (t)]} Si «H9) 




oc 


-tanhifi(t)[l - (1 - 


-tanh 2 if i (t))A i (t)])/i 6 (t) 






[l-tanh 2 i^fl-// 2 ^)]} 


(50) 


AM a6 


oc 


£{(/i«(t + l)-tanhfl a (t)[l-(l 
t 


-tanh 2 B n (t))r a (t)])^(t) 






[1-tanh 2 £ a (i)]M o6 [l -nl{t)]} 


(51) 



A few final remarks are in order. The reader might notice that the lowest- 
order corrections in (133]) . (|34p . and (H7j) resemble Thouless- Anderson-Palmer 
(TAP) corrections in spin glasses [19J. However, there the TAP equations 
come from the first corrections to the factorized-distribution approximation, 
whereas ours here come from evaluating the average energy within that ap- 
proximation. We expect that for our model here, as for spin glasses, to get 
an exact theory for large N^, TAP corrections analogous to theirs should 
also be included. We do not try to do that here, working entirely within the 
factorized-distribution ansatz. In problems like ours for networks without 
hidden units, this is sometimes called "naive mean field theory" [3]. 

4.3 Numerical results 

We have carried out mean-field inference computations for some models with 
no hidden-to-hidden connections (M a & = 0), using the lowest-order mean- 
field equations (14711511) . Fig. [S] shows how the mean square errors of Jy, K ib 
and L a j depend on the data set length T for two networks with 80 visible 
units. The left-hand panel shows the case where the number of hidden units 
Nh = 80, and the right-had panel shows the case where Nh = 20. For the 
smaller Nh, all three mean square errors fall off like 1/T, as we would expect 
to find if we could do this calculation exactly. However, for the larger iV^, 
while the errors on the visible-to- visible couplings also fall off with T in this 
way, the errors on the couplings to and from the hidden units are larger and 
fall off much more slowly. 
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Figure 5: Mean square errors on Js (blue), Ks (green) and Ls (red) computed 
in mean field theory as functions of data set length. (Color online) Left panel: 
N v = N h = 80. Right panel: Nv = 80, N h = 20. All couplings are i.i.d. 
normal, with variance 1/N V for and L a j and 1/Nh for 




Figure 6: Mean square errors on Js (blue), i^s (green) and Ls (red) as 
functions of data set length for small networks. (Color online) Left Panel: 
N v = Nh = 5. Right Panel: N v = Nh = 8. Mean-field results are solid lines; 
exact results are dashed. Couplings chosen as in Fig. 5. 



20 



We can get a little insight into this behavior by doing the mean-field cal- 
culations for small N h , where it is also possible to do the exact calculations, as 
described in Sect. 2.3. Fig. [6] shows the results of both kinds of calculations 
for N v = Nh = 5 and 8. In these cases we can see that for small T the mean- 
field and exact calculations nearly coincide. The T-dependence is in this 
region is qualitatively like that for the mean-field results at N v = = 80. 
However, at larger T, the mean-field errors all fall less rapidly. At the same 
time, the exact calculation gives errors on the Js which continue to fall off 
like 1/T, and those on the Ks and Ls also start to fall more rapidly at the 
largest Ts studied. This behavior is consistent with the expectation that, 
as for models with no hidden units, all exact-method errors should fall off 
asymptotically like 1/T, while the mean- field errors should approach lim- 
its oc l/Nh [3]. However, apparently one has to go to very large data sets 
(roughly T > 10 3 Nh) to see this. 

5 Discussion 

We have derived learning rules for two kinds of stochastic binary networks 
with hidden units. These networks differ from Boltzmann machines in that 
(1) the units in them are updated synchronously rather than asynchronously, 
and (2) the connection strengths are allowed to be asymmetric. Because of 
these differences, the usual kind of Gibbs equilibrium does not hold, and a 
new kind of treatment is required. 

The first kind of network has deterministic, continuous-valued hidden 
units. The learning rules for it are very similar to those in the back-propagation- 
in-time approach for recurrent networks where all the units are deterministic 
and continuous-valued. 

Units in the second kind of network are binary and stochastic, like the 
visible units. Here the learning problem is harder, but we have showed that 
one can always put it into the form of an equilibrium statistical mechanical 
problem with a non-polynomial energy function. The learning rules involve 
averages over the Gibbs distribution for this problem. For small systems 
and in the absence of hidden-to-hidden couplings, the problem can be solved 
exactly numerically, but otherwise one must resort to Monte Carlo methods 
or other approximations. We explored in detail one such approximation: 
mean field theory. A careful analysis revealed that the naive way one might 
write this theory was wrong, but we were able to construct a version of mean 
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field theory that was exact for weak, dense connectivity in the limit of a large 
number of hidden units (the analog of the Sherrington-Kirkpatrick model of 
spin glasses [20J). 

We also performed some numerical calculations to illustrate and to begin 
to explore some of the features of the different kinds of networks and learning 
rules. A general feature is that when the number of hidden units is large 
(i.e., comparable to the number of visible units), the errors in determining 
the couplings to and from the hidden units are much larger than those on the 
couplings among the visible units. This is true for both kinds of networks and 
for both exact learning algorithms and mean field theory. This should not be 
surprising, since the information about the connections to and from hidden 
units is only available indirectly, through the statistics of the visible units. 
On the other hand, it is noteworthy that even a rather poor estimation of the 
connections to and from the hidden units does not spoil the good estimation 
of the couplings among the visible ones. 

Another point worth mentioning is that for small data lengths mean field 
theory is as good as doing the full exact calculation, which would take pro- 
hibitively long for Nh much bigger than 10 or so. For large N^ the errors on 
connections to and from hidden units can be rather large and fall off very 
slowly with T, but the results on small systems seem to show that doing 
the exact calculation instead of mean field theory (even if this were feasible), 
would not help except at very large T. 

We have only scratched the surface of this problem in our numerical 
calculations. It would be useful to know, for example, what the asymptotic 
errors on the Ks and Ls are for the mean-field algorithm in the limit of large 
data sets and at what T the approach to these values begins, as functions of 
Nh and N v . We leave this and other questions to future work. The theory 
presented here provides a foundation for those investigations and, we hope, 
will point the way toward other questions that will be interesting to study. 
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